{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "name": "logistic_regression.ipynb",
      "version": "0.3.2",
      "views": {},
      "default_view": {},
      "provenance": [],
      "collapsed_sections": [
        "dPpJUV862FYI",
        "i2e3TlyL57Qs",
        "wCugvl0JdWYL",
        "copyright-notice"
      ]
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    }
  },
  "cells": [
    {
      "metadata": {
        "colab_type": "text",
        "id": "copyright-notice"
      },
      "cell_type": "markdown",
      "source": [
        "#### Copyright 2017 Google LLC."
      ]
    },
    {
      "metadata": {
        "colab_type": "code",
        "id": "copyright-notice2",
        "colab": {
          "autoexec": {
            "startup": false,
            "wait_interval": 0
          }
        },
        "cellView": "both"
      },
      "cell_type": "code",
      "outputs": [],
      "execution_count": 0,
      "source": [
        "# Licensed under the Apache License, Version 2.0 (the \"License\");\n",
        "# you may not use this file except in compliance with the License.\n",
        "# You may obtain a copy of the License at\n",
        "#\n",
        "# https://www.apache.org/licenses/LICENSE-2.0\n",
        "#\n",
        "# Unless required by applicable law or agreed to in writing, software\n",
        "# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
        "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
        "# See the License for the specific language governing permissions and\n",
        "# limitations under the License."
      ]
    },
    {
      "metadata": {
        "id": "g4T-_IsVbweU",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        " # Regresi\u00f3n log\u00edstica"
      ]
    },
    {
      "metadata": {
        "id": "LEAHZv4rIYHX",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        " **Objetivos de aprendizaje:**\n",
        "  * replantear el predictor del valor mediano de las casas (de los ejercicios anteriores) como un modelo de clasificaci\u00f3n binaria\n",
        "  * comparar la eficacia de la regresi\u00f3n log\u00edstica frente a la regresi\u00f3n lineal para un problema de clasificaci\u00f3n binaria"
      ]
    },
    {
      "metadata": {
        "id": "CnkCZqdIIYHY",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        " Al igual que en el ejercicio anterior, trabajamos con el conjunto de datos de viviendas en California, pero esta vez lo convertiremos en un problema de clasificaci\u00f3n binaria al predecir si una manzana es de costo elevado. Por el momento, tambi\u00e9n volveremos a los atributos predeterminados."
      ]
    },
    {
      "metadata": {
        "id": "9pltCyy2K3dd",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        " ## Planteamiento del problema como clasificaci\u00f3n binaria\n",
        "\n",
        "El objetivo de nuestro conjunto de datos es `median_house_value`, que es un atributo num\u00e9rico (de valor continuo). Podemos crear una etiqueta booleana al aplicar un umbral a este valor continuo.\n",
        "\n",
        "Dados los atributos que describen una manzana, queremos predecir si se trata de una manzana de costo elevado. Para preparar los objetivos para entrenar y evaluar los datos, definimos un umbral de clasificaci\u00f3n del percentil 75\u00b0 para el valor mediano de las casas (un valor de aproximadamente 265,000). Todos los valores de las casas por encima del umbral se etiquetan como `1` y los dem\u00e1s, como `0`."
      ]
    },
    {
      "metadata": {
        "id": "67IJwZX1Vvjt",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        " ## Preparaci\u00f3n\n",
        "\n",
        "Ejecuta las celdas a continuaci\u00f3n para cargar los datos y preparar los atributos de entrada y objetivos."
      ]
    },
    {
      "metadata": {
        "id": "fOlbcJ4EIYHd",
        "colab_type": "code",
        "colab": {
          "autoexec": {
            "startup": false,
            "wait_interval": 0
          }
        }
      },
      "source": [
        "from __future__ import print_function\n",
        "\n",
        "import math\n",
        "\n",
        "from IPython import display\n",
        "from matplotlib import cm\n",
        "from matplotlib import gridspec\n",
        "from matplotlib import pyplot as plt\n",
        "import numpy as np\n",
        "import pandas as pd\n",
        "from sklearn import metrics\n",
        "import tensorflow as tf\n",
        "from tensorflow.python.data import Dataset\n",
        "\n",
        "tf.logging.set_verbosity(tf.logging.ERROR)\n",
        "pd.options.display.max_rows = 10\n",
        "pd.options.display.float_format = '{:.1f}'.format\n",
        "\n",
        "california_housing_dataframe = pd.read_csv(\"https://download.mlcc.google.com/mledu-datasets/california_housing_train.csv\", sep=\",\")\n",
        "\n",
        "california_housing_dataframe = california_housing_dataframe.reindex(\n",
        "    np.random.permutation(california_housing_dataframe.index))"
      ],
      "cell_type": "code",
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "lTB73MNeIYHf",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        " Observa que el c\u00f3digo es levemente distinto del de los ejercicios anteriores. En lugar de usar `median_house_value` como objetivo, creamos un nuevo objetivo binario, `median_house_value_is_high`."
      ]
    },
    {
      "metadata": {
        "id": "kPSqspaqIYHg",
        "colab_type": "code",
        "colab": {
          "autoexec": {
            "startup": false,
            "wait_interval": 0
          }
        }
      },
      "source": [
        "def preprocess_features(california_housing_dataframe):\n",
        "  \"\"\"Prepares input features from California housing data set.\n",
        "\n",
        "  Args:\n",
        "    california_housing_dataframe: A Pandas DataFrame expected to contain data\n",
        "      from the California housing data set.\n",
        "  Returns:\n",
        "    A DataFrame that contains the features to be used for the model, including\n",
        "    synthetic features.\n",
        "  \"\"\"\n",
        "  selected_features = california_housing_dataframe[\n",
        "    [\"latitude\",\n",
        "     \"longitude\",\n",
        "     \"housing_median_age\",\n",
        "     \"total_rooms\",\n",
        "     \"total_bedrooms\",\n",
        "     \"population\",\n",
        "     \"households\",\n",
        "     \"median_income\"]]\n",
        "  processed_features = selected_features.copy()\n",
        "  # Create a synthetic feature.\n",
        "  processed_features[\"rooms_per_person\"] = (\n",
        "    california_housing_dataframe[\"total_rooms\"] /\n",
        "    california_housing_dataframe[\"population\"])\n",
        "  return processed_features\n",
        "\n",
        "def preprocess_targets(california_housing_dataframe):\n",
        "  \"\"\"Prepares target features (i.e., labels) from California housing data set.\n",
        "\n",
        "  Args:\n",
        "    california_housing_dataframe: A Pandas DataFrame expected to contain data\n",
        "      from the California housing data set.\n",
        "  Returns:\n",
        "    A DataFrame that contains the target feature.\n",
        "  \"\"\"\n",
        "  output_targets = pd.DataFrame()\n",
        "  # Create a boolean categorical feature representing whether the\n",
        "  # median_house_value is above a set threshold.\n",
        "  output_targets[\"median_house_value_is_high\"] = (\n",
        "    california_housing_dataframe[\"median_house_value\"] > 265000).astype(float)\n",
        "  return output_targets"
      ],
      "cell_type": "code",
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "FwOYWmXqWA6D",
        "colab_type": "code",
        "colab": {
          "autoexec": {
            "startup": false,
            "wait_interval": 0
          }
        }
      },
      "source": [
        "# Choose the first 12000 (out of 17000) examples for training.\n",
        "training_examples = preprocess_features(california_housing_dataframe.head(12000))\n",
        "training_targets = preprocess_targets(california_housing_dataframe.head(12000))\n",
        "\n",
        "# Choose the last 5000 (out of 17000) examples for validation.\n",
        "validation_examples = preprocess_features(california_housing_dataframe.tail(5000))\n",
        "validation_targets = preprocess_targets(california_housing_dataframe.tail(5000))\n",
        "\n",
        "# Double-check that we've done the right thing.\n",
        "print(\"Training examples summary:\")\n",
        "display.display(training_examples.describe())\n",
        "print(\"Validation examples summary:\")\n",
        "display.display(validation_examples.describe())\n",
        "\n",
        "print(\"Training targets summary:\")\n",
        "display.display(training_targets.describe())\n",
        "print(\"Validation targets summary:\")\n",
        "display.display(validation_targets.describe())"
      ],
      "cell_type": "code",
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "uon1LB3A31VN",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        " ## \u00bfC\u00f3mo se desempe\u00f1ar\u00eda la regresi\u00f3n lineal?\n",
        "Para ver por qu\u00e9 es eficaz la regresi\u00f3n log\u00edstica, primero entrenemos un modelo simple que use regresi\u00f3n lineal. Este modelo usar\u00e1 etiquetas con valores del conjunto `{0, 1}` e intentar\u00e1 predecir un valor continuo que est\u00e9 lo m\u00e1s cerca posible de `0` o `1`. Adem\u00e1s, queremos interpretar el resultado como una probabilidad, de manera que ser\u00eda ideal si el resultado estuviera dentro del rango `(0, 1)`. De este modo, aplicar\u00edamos un umbral de `0.5` para determinar la etiqueta.\n",
        "\n",
        "Ejecuta las celdas a continuaci\u00f3n para entrenar el modelo de regresi\u00f3n lineal con [LinearRegressor](https://www.tensorflow.org/api_docs/python/tf/estimator/LinearRegressor)."
      ]
    },
    {
      "metadata": {
        "id": "smmUYRDtWOV_",
        "colab_type": "code",
        "colab": {
          "autoexec": {
            "startup": false,
            "wait_interval": 0
          }
        }
      },
      "source": [
        "def construct_feature_columns(input_features):\n",
        "  \"\"\"Construct the TensorFlow Feature Columns.\n",
        "\n",
        "  Args:\n",
        "    input_features: The names of the numerical input features to use.\n",
        "  Returns:\n",
        "    A set of feature columns\n",
        "  \"\"\"\n",
        "  return set([tf.feature_column.numeric_column(my_feature)\n",
        "              for my_feature in input_features])"
      ],
      "cell_type": "code",
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "B5OwSrr1yIKD",
        "colab_type": "code",
        "colab": {
          "autoexec": {
            "startup": false,
            "wait_interval": 0
          }
        }
      },
      "source": [
        "def my_input_fn(features, targets, batch_size=1, shuffle=True, num_epochs=None):\n",
        "    \"\"\"Trains a linear regression model.\n",
        "  \n",
        "    Args:\n",
        "      features: pandas DataFrame of features\n",
        "      targets: pandas DataFrame of targets\n",
        "      batch_size: Size of batches to be passed to the model\n",
        "      shuffle: True or False. Whether to shuffle the data.\n",
        "      num_epochs: Number of epochs for which data should be repeated. None = repeat indefinitely\n",
        "    Returns:\n",
        "      Tuple of (features, labels) for next data batch\n",
        "    \"\"\"\n",
        "    \n",
        "    # Convert pandas data into a dict of np arrays.\n",
        "    features = {key:np.array(value) for key,value in dict(features).items()}                                            \n",
        " \n",
        "    # Construct a dataset, and configure batching/repeating.\n",
        "    ds = Dataset.from_tensor_slices((features,targets)) # warning: 2GB limit\n",
        "    ds = ds.batch(batch_size).repeat(num_epochs)\n",
        "    \n",
        "    # Shuffle the data, if specified.\n",
        "    if shuffle:\n",
        "      ds = ds.shuffle(10000)\n",
        "    \n",
        "    # Return the next batch of data.\n",
        "    features, labels = ds.make_one_shot_iterator().get_next()\n",
        "    return features, labels"
      ],
      "cell_type": "code",
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "SE2-hq8PIYHz",
        "colab_type": "code",
        "colab": {
          "autoexec": {
            "startup": false,
            "wait_interval": 0
          }
        }
      },
      "source": [
        "def train_linear_regressor_model(\n",
        "    learning_rate,\n",
        "    steps,\n",
        "    batch_size,\n",
        "    training_examples,\n",
        "    training_targets,\n",
        "    validation_examples,\n",
        "    validation_targets):\n",
        "  \"\"\"Trains a linear regression model.\n",
        "  \n",
        "  In addition to training, this function also prints training progress information,\n",
        "  as well as a plot of the training and validation loss over time.\n",
        "  \n",
        "  Args:\n",
        "    learning_rate: A `float`, the learning rate.\n",
        "    steps: A non-zero `int`, the total number of training steps. A training step\n",
        "      consists of a forward and backward pass using a single batch.\n",
        "    batch_size: A non-zero `int`, the batch size.\n",
        "    training_examples: A `DataFrame` containing one or more columns from\n",
        "      `california_housing_dataframe` to use as input features for training.\n",
        "    training_targets: A `DataFrame` containing exactly one column from\n",
        "      `california_housing_dataframe` to use as target for training.\n",
        "    validation_examples: A `DataFrame` containing one or more columns from\n",
        "      `california_housing_dataframe` to use as input features for validation.\n",
        "    validation_targets: A `DataFrame` containing exactly one column from\n",
        "      `california_housing_dataframe` to use as target for validation.\n",
        "      \n",
        "  Returns:\n",
        "    A `LinearRegressor` object trained on the training data.\n",
        "  \"\"\"\n",
        "\n",
        "  periods = 10\n",
        "  steps_per_period = steps / periods\n",
        "\n",
        "  # Create a linear regressor object.\n",
        "  my_optimizer = tf.train.GradientDescentOptimizer(learning_rate=learning_rate)\n",
        "  my_optimizer = tf.contrib.estimator.clip_gradients_by_norm(my_optimizer, 5.0)\n",
        "  linear_regressor = tf.estimator.LinearRegressor(\n",
        "      feature_columns=construct_feature_columns(training_examples),\n",
        "      optimizer=my_optimizer\n",
        "  )\n",
        "    \n",
        "  # Create input functions.  \n",
        "  training_input_fn = lambda: my_input_fn(training_examples, \n",
        "                                          training_targets[\"median_house_value_is_high\"], \n",
        "                                          batch_size=batch_size)\n",
        "  predict_training_input_fn = lambda: my_input_fn(training_examples, \n",
        "                                                  training_targets[\"median_house_value_is_high\"], \n",
        "                                                  num_epochs=1, \n",
        "                                                  shuffle=False)\n",
        "  predict_validation_input_fn = lambda: my_input_fn(validation_examples, \n",
        "                                                    validation_targets[\"median_house_value_is_high\"], \n",
        "                                                    num_epochs=1, \n",
        "                                                    shuffle=False)\n",
        "\n",
        "  # Train the model, but do so inside a loop so that we can periodically assess\n",
        "  # loss metrics.\n",
        "  print(\"Training model...\")\n",
        "  print(\"RMSE (on training data):\")\n",
        "  training_rmse = []\n",
        "  validation_rmse = []\n",
        "  for period in range (0, periods):\n",
        "    # Train the model, starting from the prior state.\n",
        "    linear_regressor.train(\n",
        "        input_fn=training_input_fn,\n",
        "        steps=steps_per_period\n",
        "    )\n",
        "    \n",
        "    # Take a break and compute predictions.\n",
        "    training_predictions = linear_regressor.predict(input_fn=predict_training_input_fn)\n",
        "    training_predictions = np.array([item['predictions'][0] for item in training_predictions])\n",
        "    \n",
        "    validation_predictions = linear_regressor.predict(input_fn=predict_validation_input_fn)\n",
        "    validation_predictions = np.array([item['predictions'][0] for item in validation_predictions])\n",
        "    \n",
        "    # Compute training and validation loss.\n",
        "    training_root_mean_squared_error = math.sqrt(\n",
        "        metrics.mean_squared_error(training_predictions, training_targets))\n",
        "    validation_root_mean_squared_error = math.sqrt(\n",
        "        metrics.mean_squared_error(validation_predictions, validation_targets))\n",
        "    # Occasionally print the current loss.\n",
        "    print(\"  period %02d : %0.2f\" % (period, training_root_mean_squared_error))\n",
        "    # Add the loss metrics from this period to our list.\n",
        "    training_rmse.append(training_root_mean_squared_error)\n",
        "    validation_rmse.append(validation_root_mean_squared_error)\n",
        "  print(\"Model training finished.\")\n",
        "  \n",
        "  # Output a graph of loss metrics over periods.\n",
        "  plt.ylabel(\"RMSE\")\n",
        "  plt.xlabel(\"Periods\")\n",
        "  plt.title(\"Root Mean Squared Error vs. Periods\")\n",
        "  plt.tight_layout()\n",
        "  plt.plot(training_rmse, label=\"training\")\n",
        "  plt.plot(validation_rmse, label=\"validation\")\n",
        "  plt.legend()\n",
        "\n",
        "  return linear_regressor"
      ],
      "cell_type": "code",
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "TDBD8xeeIYH2",
        "colab_type": "code",
        "colab": {
          "autoexec": {
            "startup": false,
            "wait_interval": 0
          }
        }
      },
      "source": [
        "linear_regressor = train_linear_regressor_model(\n",
        "    learning_rate=0.000001,\n",
        "    steps=200,\n",
        "    batch_size=20,\n",
        "    training_examples=training_examples,\n",
        "    training_targets=training_targets,\n",
        "    validation_examples=validation_examples,\n",
        "    validation_targets=validation_targets)"
      ],
      "cell_type": "code",
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "JjBZ_q7aD9gh",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        " ## Tarea\u00a01: \u00bfPodemos calcular la p\u00e9rdida log\u00edstica para estas predicciones?\n",
        "\n",
        "**Examina las predicciones y decide si podemos usarlas o no para calcular la p\u00e9rdida log\u00edstica.**\n",
        "\n",
        "`LinearRegressor` usa la p\u00e9rdida L2, que no se desempe\u00f1a muy bien al penalizar las clasificaciones incorrectas cuando el resultado se interpreta como una probabilidad. Por ejemplo, deber\u00eda haber una diferencia enorme si un ejemplo negativo se clasificara como positivo con una probabilidad de 0.9 frente a 0.9999, pero la p\u00e9rdida L2 no diferencia estos casos de forma taxativa.\n",
        "\n",
        "Por el contrario, `LogLoss` penaliza mucho m\u00e1s estos \"errores de certeza\". Recuerda que `LogLoss` se define de la siguiente manera:\n",
        "\n",
        "$$P\u00e9rdida log\u00edstica = \\sum_{(x,y)\\in D} -y \\cdot log(y_{pred}) - (1 - y) \\cdot log(1 - y_{pred})$$\n",
        "\n",
        "\n",
        "Pero, primero, tendremos que obtener los valores de predicci\u00f3n. Podr\u00edamos usar `LinearRegressor.predict` para obtenerlos.\n",
        "\n",
        "Dadas las predicciones y los objetivos, \u00bfpodemos calcular `LogLoss`?"
      ]
    },
    {
      "metadata": {
        "id": "dPpJUV862FYI",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        " ### Soluci\u00f3n\n",
        "\n",
        "Haz clic m\u00e1s abajo para que se muestre la soluci\u00f3n."
      ]
    },
    {
      "metadata": {
        "id": "kXFQ5uig2RoP",
        "colab_type": "code",
        "colab": {
          "autoexec": {
            "startup": false,
            "wait_interval": 0
          }
        }
      },
      "source": [
        "predict_validation_input_fn = lambda: my_input_fn(validation_examples, \n",
        "                                                  validation_targets[\"median_house_value_is_high\"], \n",
        "                                                  num_epochs=1, \n",
        "                                                  shuffle=False)\n",
        "\n",
        "validation_predictions = linear_regressor.predict(input_fn=predict_validation_input_fn)\n",
        "validation_predictions = np.array([item['predictions'][0] for item in validation_predictions])\n",
        "\n",
        "_ = plt.hist(validation_predictions)"
      ],
      "cell_type": "code",
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "rYpy336F9wBg",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        " ## Tarea\u00a02: Entrenar un modelo de regresi\u00f3n log\u00edstica y calcular la p\u00e9rdida log\u00edstica con el conjunto de validaci\u00f3n\n",
        "Para usar regresi\u00f3n log\u00edstica, simplemente usa [LinearClassifier](https://www.tensorflow.org/api_docs/python/tf/estimator/LinearClassifier) en lugar de `LinearRegressor`. Completa el c\u00f3digo a continuaci\u00f3n.\n",
        "**NOTA**: Al ejecutar `train()` y `predict()` en un modelo de `LinearClassifier`, puedes acceder a las probabilidades predichas con valores reales a trav\u00e9s de la clave `\"probabilities\"` en el diccionario devuelto, p.\u00a0ej., `predictions[\"probabilities\"]`. La funci\u00f3n [log_loss](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.log_loss.html) de Sklearn es pr\u00e1ctica para calcular la p\u00e9rdida log\u00edstica con estas probabilidades.\n"
      ]
    },
    {
      "metadata": {
        "id": "JElcb--E9wBm",
        "colab_type": "code",
        "colab": {
          "autoexec": {
            "startup": false,
            "wait_interval": 0
          }
        }
      },
      "source": [
        "def train_linear_classifier_model(\n",
        "    learning_rate,\n",
        "    steps,\n",
        "    batch_size,\n",
        "    training_examples,\n",
        "    training_targets,\n",
        "    validation_examples,\n",
        "    validation_targets):\n",
        "  \"\"\"Trains a linear classification model.\n",
        "  \n",
        "  In addition to training, this function also prints training progress information,\n",
        "  as well as a plot of the training and validation loss over time.\n",
        "  \n",
        "  Args:\n",
        "    learning_rate: A `float`, the learning rate.\n",
        "    steps: A non-zero `int`, the total number of training steps. A training step\n",
        "      consists of a forward and backward pass using a single batch.\n",
        "    batch_size: A non-zero `int`, the batch size.\n",
        "    training_examples: A `DataFrame` containing one or more columns from\n",
        "      `california_housing_dataframe` to use as input features for training.\n",
        "    training_targets: A `DataFrame` containing exactly one column from\n",
        "      `california_housing_dataframe` to use as target for training.\n",
        "    validation_examples: A `DataFrame` containing one or more columns from\n",
        "      `california_housing_dataframe` to use as input features for validation.\n",
        "    validation_targets: A `DataFrame` containing exactly one column from\n",
        "      `california_housing_dataframe` to use as target for validation.\n",
        "      \n",
        "  Returns:\n",
        "    A `LinearClassifier` object trained on the training data.\n",
        "  \"\"\"\n",
        "\n",
        "  periods = 10\n",
        "  steps_per_period = steps / periods\n",
        "  \n",
        "  # Create a linear classifier object.\n",
        "  my_optimizer = tf.train.GradientDescentOptimizer(learning_rate=learning_rate)\n",
        "  my_optimizer = tf.contrib.estimator.clip_gradients_by_norm(my_optimizer, 5.0)\n",
        "  linear_classifier = # YOUR CODE HERE: Construct the linear classifier.\n",
        "  \n",
        "  # Create input functions.\n",
        "  training_input_fn = lambda: my_input_fn(training_examples, \n",
        "                                          training_targets[\"median_house_value_is_high\"], \n",
        "                                          batch_size=batch_size)\n",
        "  predict_training_input_fn = lambda: my_input_fn(training_examples, \n",
        "                                                  training_targets[\"median_house_value_is_high\"], \n",
        "                                                  num_epochs=1, \n",
        "                                                  shuffle=False)\n",
        "  predict_validation_input_fn = lambda: my_input_fn(validation_examples, \n",
        "                                                    validation_targets[\"median_house_value_is_high\"], \n",
        "                                                    num_epochs=1, \n",
        "                                                    shuffle=False)\n",
        "  \n",
        "  # Train the model, but do so inside a loop so that we can periodically assess\n",
        "  # loss metrics.\n",
        "  print(\"Training model...\")\n",
        "  print(\"LogLoss (on training data):\")\n",
        "  training_log_losses = []\n",
        "  validation_log_losses = []\n",
        "  for period in range (0, periods):\n",
        "    # Train the model, starting from the prior state.\n",
        "    linear_classifier.train(\n",
        "        input_fn=training_input_fn,\n",
        "        steps=steps_per_period\n",
        "    )\n",
        "    # Take a break and compute predictions.    \n",
        "    training_probabilities = linear_classifier.predict(input_fn=predict_training_input_fn)\n",
        "    training_probabilities = np.array([item['probabilities'] for item in training_probabilities])\n",
        "    \n",
        "    validation_probabilities = linear_classifier.predict(input_fn=predict_validation_input_fn)\n",
        "    validation_probabilities = np.array([item['probabilities'] for item in validation_probabilities])\n",
        "    \n",
        "    training_log_loss = metrics.log_loss(training_targets, training_probabilities)\n",
        "    validation_log_loss = metrics.log_loss(validation_targets, validation_probabilities)\n",
        "    # Occasionally print the current loss.\n",
        "    print(\"  period %02d : %0.2f\" % (period, training_log_loss))\n",
        "    # Add the loss metrics from this period to our list.\n",
        "    training_log_losses.append(training_log_loss)\n",
        "    validation_log_losses.append(validation_log_loss)\n",
        "  print(\"Model training finished.\")\n",
        "  \n",
        "  # Output a graph of loss metrics over periods.\n",
        "  plt.ylabel(\"LogLoss\")\n",
        "  plt.xlabel(\"Periods\")\n",
        "  plt.title(\"LogLoss vs. Periods\")\n",
        "  plt.tight_layout()\n",
        "  plt.plot(training_log_losses, label=\"training\")\n",
        "  plt.plot(validation_log_losses, label=\"validation\")\n",
        "  plt.legend()\n",
        "\n",
        "  return linear_classifier"
      ],
      "cell_type": "code",
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "VM0wmnFUIYH9",
        "colab_type": "code",
        "colab": {
          "autoexec": {
            "startup": false,
            "wait_interval": 0
          }
        }
      },
      "source": [
        "linear_classifier = train_linear_classifier_model(\n",
        "    learning_rate=0.000005,\n",
        "    steps=500,\n",
        "    batch_size=20,\n",
        "    training_examples=training_examples,\n",
        "    training_targets=training_targets,\n",
        "    validation_examples=validation_examples,\n",
        "    validation_targets=validation_targets)"
      ],
      "cell_type": "code",
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "i2e3TlyL57Qs",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        " ### Soluci\u00f3n\n",
        "Haz clic m\u00e1s abajo para ver la soluci\u00f3n.\n"
      ]
    },
    {
      "metadata": {
        "id": "5YxXd2hn6MuF",
        "colab_type": "code",
        "colab": {
          "autoexec": {
            "startup": false,
            "wait_interval": 0
          }
        }
      },
      "source": [
        "def train_linear_classifier_model(\n",
        "    learning_rate,\n",
        "    steps,\n",
        "    batch_size,\n",
        "    training_examples,\n",
        "    training_targets,\n",
        "    validation_examples,\n",
        "    validation_targets):\n",
        "  \"\"\"Trains a linear classification model.\n",
        "  \n",
        "  In addition to training, this function also prints training progress information,\n",
        "  as well as a plot of the training and validation loss over time.\n",
        "  \n",
        "  Args:\n",
        "    learning_rate: A `float`, the learning rate.\n",
        "    steps: A non-zero `int`, the total number of training steps. A training step\n",
        "      consists of a forward and backward pass using a single batch.\n",
        "    batch_size: A non-zero `int`, the batch size.\n",
        "    training_examples: A `DataFrame` containing one or more columns from\n",
        "      `california_housing_dataframe` to use as input features for training.\n",
        "    training_targets: A `DataFrame` containing exactly one column from\n",
        "      `california_housing_dataframe` to use as target for training.\n",
        "    validation_examples: A `DataFrame` containing one or more columns from\n",
        "      `california_housing_dataframe` to use as input features for validation.\n",
        "    validation_targets: A `DataFrame` containing exactly one column from\n",
        "      `california_housing_dataframe` to use as target for validation.\n",
        "      \n",
        "  Returns:\n",
        "    A `LinearClassifier` object trained on the training data.\n",
        "  \"\"\"\n",
        "\n",
        "  periods = 10\n",
        "  steps_per_period = steps / periods\n",
        "  \n",
        "  # Create a linear classifier object.\n",
        "  my_optimizer = tf.train.GradientDescentOptimizer(learning_rate=learning_rate)\n",
        "  my_optimizer = tf.contrib.estimator.clip_gradients_by_norm(my_optimizer, 5.0)  \n",
        "  linear_classifier = tf.estimator.LinearClassifier(\n",
        "      feature_columns=construct_feature_columns(training_examples),\n",
        "      optimizer=my_optimizer\n",
        "  )\n",
        "  \n",
        "  # Create input functions.\n",
        "  training_input_fn = lambda: my_input_fn(training_examples, \n",
        "                                          training_targets[\"median_house_value_is_high\"], \n",
        "                                          batch_size=batch_size)\n",
        "  predict_training_input_fn = lambda: my_input_fn(training_examples, \n",
        "                                                  training_targets[\"median_house_value_is_high\"], \n",
        "                                                  num_epochs=1, \n",
        "                                                  shuffle=False)\n",
        "  predict_validation_input_fn = lambda: my_input_fn(validation_examples, \n",
        "                                                    validation_targets[\"median_house_value_is_high\"], \n",
        "                                                    num_epochs=1, \n",
        "                                                    shuffle=False)\n",
        "  \n",
        "  # Train the model, but do so inside a loop so that we can periodically assess\n",
        "  # loss metrics.\n",
        "  print(\"Training model...\")\n",
        "  print(\"LogLoss (on training data):\")\n",
        "  training_log_losses = []\n",
        "  validation_log_losses = []\n",
        "  for period in range (0, periods):\n",
        "    # Train the model, starting from the prior state.\n",
        "    linear_classifier.train(\n",
        "        input_fn=training_input_fn,\n",
        "        steps=steps_per_period\n",
        "    )\n",
        "    # Take a break and compute predictions.    \n",
        "    training_probabilities = linear_classifier.predict(input_fn=predict_training_input_fn)\n",
        "    training_probabilities = np.array([item['probabilities'] for item in training_probabilities])\n",
        "    \n",
        "    validation_probabilities = linear_classifier.predict(input_fn=predict_validation_input_fn)\n",
        "    validation_probabilities = np.array([item['probabilities'] for item in validation_probabilities])\n",
        "    \n",
        "    training_log_loss = metrics.log_loss(training_targets, training_probabilities)\n",
        "    validation_log_loss = metrics.log_loss(validation_targets, validation_probabilities)\n",
        "    # Occasionally print the current loss.\n",
        "    print(\"  period %02d : %0.2f\" % (period, training_log_loss))\n",
        "    # Add the loss metrics from this period to our list.\n",
        "    training_log_losses.append(training_log_loss)\n",
        "    validation_log_losses.append(validation_log_loss)\n",
        "  print(\"Model training finished.\")\n",
        "  \n",
        "  # Output a graph of loss metrics over periods.\n",
        "  plt.ylabel(\"LogLoss\")\n",
        "  plt.xlabel(\"Periods\")\n",
        "  plt.title(\"LogLoss vs. Periods\")\n",
        "  plt.tight_layout()\n",
        "  plt.plot(training_log_losses, label=\"training\")\n",
        "  plt.plot(validation_log_losses, label=\"validation\")\n",
        "  plt.legend()\n",
        "\n",
        "  return linear_classifier"
      ],
      "cell_type": "code",
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "UPM_T1FXsTaL",
        "colab_type": "code",
        "colab": {
          "autoexec": {
            "startup": false,
            "wait_interval": 0
          }
        }
      },
      "source": [
        "linear_classifier = train_linear_classifier_model(\n",
        "    learning_rate=0.000005,\n",
        "    steps=500,\n",
        "    batch_size=20,\n",
        "    training_examples=training_examples,\n",
        "    training_targets=training_targets,\n",
        "    validation_examples=validation_examples,\n",
        "    validation_targets=validation_targets)"
      ],
      "cell_type": "code",
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "i-Xo83_aR6s_",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        " ## Tarea\u00a03: Calcular la exactitud y representar una curva ROC para el conjunto de validaci\u00f3n\n",
        "\n",
        "Algunas de las m\u00e9tricas que resultan \u00fatiles para la clasificaci\u00f3n son la [exactitud](https://es.wikipedia.org/wiki/Precisi%C3%B3n_y_exactitud) del modelo, la [curva ROC](https://es.wikipedia.org/wiki/Curva_ROC) y el \u00e1rea bajo la curva ROC (AUC). Examinaremos estas m\u00e9tricas.\n",
        "\n",
        "`LinearClassifier.evaluate` calcula las m\u00e9tricas \u00fatiles, como la exactitud y el AUC."
      ]
    },
    {
      "metadata": {
        "id": "DKSQ87VVIYIA",
        "colab_type": "code",
        "colab": {
          "autoexec": {
            "startup": false,
            "wait_interval": 0
          }
        }
      },
      "source": [
        "evaluation_metrics = linear_classifier.evaluate(input_fn=predict_validation_input_fn)\n",
        "\n",
        "print(\"AUC on the validation set: %0.2f\" % evaluation_metrics['auc'])\n",
        "print(\"Accuracy on the validation set: %0.2f\" % evaluation_metrics['accuracy'])"
      ],
      "cell_type": "code",
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "47xGS2uNIYIE",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        " Para obtener las tasas de verdaderos positivos y falsos positivos que se necesitan para representar una curva ROC, puedes usar probabilidades de clase, como aquellas que se calculan con `LinearClassifier.predict`, y la funci\u00f3n [roc_curve](http://scikit-learn.org/stable/modules/model_evaluation.html#roc-metrics) de Sklearn."
      ]
    },
    {
      "metadata": {
        "id": "xaU7ttj8IYIF",
        "colab_type": "code",
        "colab": {
          "autoexec": {
            "startup": false,
            "wait_interval": 0
          }
        }
      },
      "source": [
        "validation_probabilities = linear_classifier.predict(input_fn=predict_validation_input_fn)\n",
        "# Get just the probabilities for the positive class.\n",
        "validation_probabilities = np.array([item['probabilities'][1] for item in validation_probabilities])\n",
        "\n",
        "false_positive_rate, true_positive_rate, thresholds = metrics.roc_curve(\n",
        "    validation_targets, validation_probabilities)\n",
        "plt.plot(false_positive_rate, true_positive_rate, label=\"our model\")\n",
        "plt.plot([0, 1], [0, 1], label=\"random classifier\")\n",
        "_ = plt.legend(loc=2)"
      ],
      "cell_type": "code",
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "PIdhwfgzIYII",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        " **Comprueba si puedes ajustar la configuraci\u00f3n de aprendizaje del modelo entrenado en la Tarea\u00a02 para mejorar el AUC.**\n",
        "\n",
        "Con frecuencia, determinadas m\u00e9tricas mejoran en detrimento de otras, y debes encontrar la configuraci\u00f3n que logre un buen equilibrio.\n",
        "\n",
        "**Verifica si todas las m\u00e9tricas mejoran al mismo tiempo.**"
      ]
    },
    {
      "metadata": {
        "id": "XKIqjsqcCaxO",
        "colab_type": "code",
        "colab": {
          "autoexec": {
            "startup": false,
            "wait_interval": 0
          }
        }
      },
      "source": [
        "# TUNE THE SETTINGS BELOW TO IMPROVE AUC\n",
        "linear_classifier = train_linear_classifier_model(\n",
        "    learning_rate=0.000005,\n",
        "    steps=500,\n",
        "    batch_size=20,\n",
        "    training_examples=training_examples,\n",
        "    training_targets=training_targets,\n",
        "    validation_examples=validation_examples,\n",
        "    validation_targets=validation_targets)\n",
        "\n",
        "evaluation_metrics = linear_classifier.evaluate(input_fn=predict_validation_input_fn)\n",
        "\n",
        "print(\"AUC on the validation set: %0.2f\" % evaluation_metrics['auc'])\n",
        "print(\"Accuracy on the validation set: %0.2f\" % evaluation_metrics['accuracy'])"
      ],
      "cell_type": "code",
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "id": "wCugvl0JdWYL",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        " ### Soluci\u00f3n\n",
        "\n",
        "Haz clic m\u00e1s abajo para conocer una soluci\u00f3n posible."
      ]
    },
    {
      "metadata": {
        "id": "VHosS1g2aetf",
        "colab_type": "text"
      },
      "cell_type": "markdown",
      "source": [
        " Una soluci\u00f3n posible que funciona es simplemente entrenar por m\u00e1s tiempo, siempre que no realicemos un sobreajuste. \n",
        "\n",
        "Esto se puede lograr al incrementar el n\u00famero de pasos, el tama\u00f1o del lote o ambos.\n",
        "\n",
        "Todas las m\u00e9tricas mejoran a la vez, de manera que nuestra m\u00e9trica de p\u00e9rdida es un buen representante tanto del AUC como de la exactitud.\n",
        "\n",
        "Observa c\u00f3mo hace muchas m\u00e1s iteraciones simplemente para restringir algunas unidades m\u00e1s del AUC. Esto ocurre com\u00fanmente, pero, con frecuencia, incluso esta peque\u00f1a ganancia vale la pena."
      ]
    },
    {
      "metadata": {
        "id": "dWgTEYMddaA-",
        "colab_type": "code",
        "colab": {
          "autoexec": {
            "startup": false,
            "wait_interval": 0
          }
        }
      },
      "source": [
        "linear_classifier = train_linear_classifier_model(\n",
        "    learning_rate=0.000003,\n",
        "    steps=20000,\n",
        "    batch_size=500,\n",
        "    training_examples=training_examples,\n",
        "    training_targets=training_targets,\n",
        "    validation_examples=validation_examples,\n",
        "    validation_targets=validation_targets)\n",
        "\n",
        "evaluation_metrics = linear_classifier.evaluate(input_fn=predict_validation_input_fn)\n",
        "\n",
        "print(\"AUC on the validation set: %0.2f\" % evaluation_metrics['auc'])\n",
        "print(\"Accuracy on the validation set: %0.2f\" % evaluation_metrics['accuracy'])"
      ],
      "cell_type": "code",
      "execution_count": 0,
      "outputs": []
    }
  ]
}
